knitr::opts_chunk$set(warning=FALSE)

Needed libraries

library(dplyr)
library(countrycode)
library(outliers)
library(caret)
library(cluster)
library(factoextra)
library(NbClust)

phase 1

Problem statement

Prediction of cyber security employees’ salaries based on 11 attributes

1.work_year

2.experience_level

3.employment_type

4.job_title

5.salary

6.salary_currency

7.salary_in_usd

8.employee_residence

9.remote_ratio

10.company_location

11.company_size

Problem description

We are living in the “information age” or rather the “data age”, meaning that everything around us revolves around data. The data has become one of the most valuable assets that a person or an organisation can have, since it has a significant value, losing it will lead to significant damages. Thus, most of the attacks nowadays are directed toward the data. To guard against such damages, organisations have realised the importance of protecting their digital assets, leading them to hire cybersecurity specialists. This made cybersecurity gain popularity among people so there’s a growing tendency to study cybersecurity. Consequently this resulted in the emergence of plentiful professionals with various experience levels and skills in this field. As a result, organisations may find it difficult to decide a salary for job candidates solely based on the CV. also, since the attacks improve rapidly, organisations need to hire more employees in the far future to defend against such attacks but it’s not an easy matter to predict the future payroll which may hinders some of the organisation’s plans. Another issue arises when the decision makers in the organisation aren’t fully aware of the trends on salary. Their lack of awareness gives a chance for the competitor organisations to attract their employees to them by offering a better salary that match current trends

Data mining task

Prediction of the cyber security employees’ salary categories (Very Low, Low, , High, Very High) using classification methods.

Goal

Given the problems we discussed and In order to better understand this field, we decided to analyse a dataset of 1247 cybersecurity employees, containing information such as salary, job title, and experience level. Analysing this dataset can provide insightful predictions regarding the salary range of a cybersecurity employee, which can help in

  • Making better decisions
  • Making recruitment and hiring process easier and more efficient
  • Predicting the future payroll
  • Increasing loyalty
  • Increasing the satisfaction rate
  • Achieving fairness

Source of data:

https://www.kaggle.com/datasets/deepcontractor/cyber-security-salaries

Reading and viewing dataset

dataset= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/salaries_cyber.csv"), header=TRUE)
View(dataset)

Original dataset

we will keep a copy of the original dataset before data preprocessing to use if needed at any time

originalDataset= dataset

General information about the dataset:

No. of attributes: 11
Type of attributes: Ordinal , Nominal, and Numeric
No. of objects: 1247
Class label: salary_in_usd

ncol(dataset)
nrow(dataset)
names(dataset)
str(dataset)

Attributes’ description table

Attribute Name Description Data Type Possible values
work_year The year in which salary was recorded Numerical 2020 to 2022
experience_level Expertise level of the employee Ordinal En “Entry level”
MI “Mid level”
SE “Senior level”
EX “Executive level”
employment_type The nature or category of employee’s engagement in the job Nominal PT “Part time”
FT “Full time”
CT “Contract
FL”Freelancer”
job_title The role worked in during the year Nominal

Different titles.

like Security Analyst, security researcher

salary The total gross salary amount paid Numerical 1740-50001566
salary_currency The currency of the salary paid to the employee Nominal

Different currencies according to ISO 4217 currency code.

like DE,CA

salary_in_usd The salary paid in United states dollar Numerical 2000 to 365596.40
employee_residence Employee’s primary country of residence Nominal

Different countries.

like US,AE

remote_ratio Percentage of online work by employee in the specified year Numerical 0 “No remote work”
50 “Partially remote”
100 “Fully remote”
company_location The country of the employer’s main office Nominal

Different countries.

like BR,BW

company_size How big/small is the company Ordinal S , M or L

phase 2

sample of 20 employees from the dataset:

using sample_n(table,size) function and using (set_seed())

set.seed(30)
sample=sample_n(dataset,20)
print(sample)

Show the missing value:

if it is FALSE it means no null value,if it is TRUE there is null value. In our dataset there is no null values.

is.na(dataset)
sum(is.na(dataset))

Show the Min.,1st Qu.,Median,Mean ,3rd Qu.,Max. for each numeric column

summary(dataset$work_year)
summary(dataset$salary)
summary(dataset$salary_in_usd)
summary(dataset$remote_ratio)

Show the variane of each numeric column

var(dataset$work_year)
var(dataset$salary)
var(dataset$salary_in_usd)
var(dataset$remote_ratio)

Visualization of relationship between some pairs of attributes:

Here we used boxplot to see the distribution between salary_in_usd and experience_level We observed that salaries vary depending on the level of experience,they are positively correlated.

boxplot(salary_in_usd ~ experience_level, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)

Here we used boxplot to see the distribution between salary_in_usd and work_year We observed that 2021 salaries were close to each other but in 2022 the gap between them getting bigger.

boxplot(salary_in_usd ~ work_year, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)

Here we used boxplot to see the distribution between salary_in_usd and employment_type We observed that Full Time (FT) offers more salary than the other categories.

boxplot(salary_in_usd ~ employment_type, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)

Here we used boxplot to see the distribution between salary_in_usd and company_size We observed that the larger the company is the higher the salary was.

boxplot(salary_in_usd ~ company_size, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999) 

Data Reduction

Dimensionality Reduction

The “salary” column gives the same information as “salary_in_usd” it’s just a matter of currency exchange, and we will eventually transform all the values in “salary” column to one common currency so we can properly deal with them. To further confirm that the two column are redundant, we will use the latest exchange rate for USD to the desired currency.

we will start by creating a temporary column named “converted_salary” to save the salary that we will get by using the exchange rate to convert the salary_in_usd to the salary with different currencies to compare with “salary” column

convertedDataset=dataset


convertedDataset$exchange_rate = factor(convertedDataset$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1/1,1/0.20,1/1.22,1/1.06,1/0.012,1/0.74,1/1.10,1/0.14,1/0.73,1/0.64,1/0.090,1/0.057,1/0.26,1/0.23,1/0.093,1/0.000065,1/0.60,1/0.0027,1/0.053,1/0.031,1/0.010))
convertedDataset$exchange_rate = as.numeric(as.character(convertedDataset$exchange_rate))
convertedDataset$converted_salary = convertedDataset$salary_in_usd*convertedDataset$exchange_rate



set.seed(1)
salary_sample <- sample_n(convertedDataset[,c("salary","converted_salary")],10)

print(salary_sample)

as shown in the sample, the two columns are almost identical. This can be proved by correlation coefficient as well.

correlation <- cor(convertedDataset$salary , convertedDataset$converted_salary)
print(correlation)

The correlation is so high but it hasn’t reached 100% possibly due to rounding in the calculations and slight differences in the exchange rate over time.

To make the mining process more effiecent and has an improved quality, we decided to remove the “salary” column.

dataset<-dataset[,-c(5)]

Find the outliers and remove them:

We will show outliers with boxPlots and then remove them, to minimize noise and to get better analytical results when applying data mining techniques.

now we show (salary_in_usd) attributes’ outliers. we can see that there are many outliers with exceptionally high values, thus we will remove them.

boxplot(dataset$salary_in_usd)



OutSalary = outlier(dataset$salary_in_usd, logical =TRUE)
Find_outlier = which(OutSalary ==TRUE, arr.ind = TRUE)
dataset= dataset[-Find_outlier,]

now we show (remote_ratio) attributes’ outliers. we can see there aren’t outliers in remote_ratio, thus we don’t need the last step i.e: removing outliers’ rows.

boxplot(dataset$remote_ratio)

now we show (work_year) attributes’ outliers. we can see there aren’t outliers in work_year, thus we don’t need the last step i.e: removing outliers’ rows.

boxplot(dataset$work_year)

Concept hierarchy generation for nominal data

the columns “company_location” and “employee_residence” have the name of countries for the company and employee respectively. And these attributes can be generalized to higher-level concept that is region to help understand and analyze the dataset better and improve algorithm performance.

We will use the 7 regions as defined in the World Bank Development Indicators. These regions are:

  1. East Asia and Pacific: This region includes countries like China, Australia, Indonesia, Thailand, etc.

  2. Europe and Central Asia: This region includes countries like Germany, UK, Russia, Turkey, etc.

  3. Latin America & Caribbean: This region includes countries like Brazil, Mexico, Argentina, Cuba, etc.

  4. Middle East and North Africa: This region includes countries like Saudi Arabia, Egypt, Iran, Iraq, etc.

  5. North America: This is predominantly United States and Canada.

  6. South Asia: This region includes countries like India, Pakistan, Bangladesh, Sri Lanka, etc.

  7. Sub-Saharan Africa: This region includes countries like Nigeria, South Africa, Ethiopia, Kenya, etc.

Note: UM(The United States Minor Outlying Islands) and AQ(Antarctica) don’t belong to any of these regions, thus, they will be used as they are.



um=which(dataset$company_location=="UM")
aq=which(dataset$company_location=="AQ")


dataset$company_location <- countrycode(dataset$company_location, "iso2c", "region")
dataset$employee_residence <- countrycode(dataset$employee_residence, "iso2c", "region")

dataset[um,"company_location"]="UM"
dataset[aq,"company_location"]="AQ"

Concept hierarchy generation can be done for “job_title” as well to improve interpretation and scalability. Also, most job titles are essentially the same job but with different names, so we can combine them into a higher-level jobs titles such as Architect, Analyst and Engineer etc.

## Create the categories based on job rank 
dataset$job_title <- ifelse(grepl("Analyst", dataset$job_title), "Analyst",
                                ifelse(grepl("Architect", dataset$job_title), "Architect",
                                       ifelse(grepl("Engineer", dataset$job_title), "Engineer",
                                              ifelse(grepl("Manager|Officer|Director|Leader", dataset$job_title), "Leadership",
                                                     ifelse(grepl("Consultant|Specialist", dataset$job_title), "Consultant/Specialist",
                                                            ifelse(grepl("Cyber", dataset$job_title), "Cyber Security",
                                                                   "Others"))))))

Encoding categorical data

To deal with columns with character type we are going to encode them, because most machine learning algorithms are designed to work with factors data rather than character data and it improves performance and Interpretability of data as well.

dataset$job_title  <- factor(dataset$job_title)

dataset$experience_level = factor(dataset$experience_level, levels=c("EN", "MI", "SE", "EX"), labels=c(1,2,3,4))

dataset$employment_type  <- factor(dataset$employment_type)

dataset$employee_residence  <- factor(dataset$employee_residence)

dataset$company_location  <- factor(dataset$company_location)

dataset$salary_currency  <- factor(dataset$salary_currency)

dataset$job_title  <- factor(dataset$job_title)


dataset$company_size = factor(dataset$company_size, levels=c("S","M","L"), labels=c(1,2,3))


dataset$job_title  <- factor(dataset$job_title)

Discretization of salaray_in_usd attribute

by calculating breaks based on quartiles

breaks <- quantile(dataset$salary_in_usd, 
                   probs = c(0, .25, .5, .75, .95, 1), 
                   na.rm = TRUE)


dataset$salary_in_usd <- cut(dataset$salary_in_usd, 
                                       breaks = breaks, 
                                       include.lowest = TRUE, 
                                       labels=c("Very Low", "Low", "Medium", "High", "Very High"))

Normalization:

to change the scale of numeric attributes (remote_ratio and work_year) to a scale of [-1,1] to give them equal weight

dataset [, c("work_year" , "remote_ratio")] = scale(dataset [, c("work_year" , "remote_ratio")])

Feature Selection

we will implement feature selection to remove redundant or irrelevant attributes from the data set to get the smallest subset that can help us get the most accurate predictions for our target class(salary_in_usd) and decrease the time that it takes the classifier to process the data.

we will use RFE(Recursive feature elimination) which is a wrapper method for the feature selection. Since the RFE function have multiple control options we need to specify the options that we want. We will choose “Random Forest” because it has high accuracy, can handle categorical data.

control <- rfeControl(functions = rfFuncs, 
                      method = "repeatedcv",
                      repeats = 5, 
                      number = 10)

First we save the features to be used in the feature selection(every attributes except the class label “salary_in_usd”) in variable x, and the class label in variable y. Then split the data to 80% training and 20% test.

x <- dataset %>%
  select(-salary_in_usd) %>%
  as.data.frame()

# Target variable
y <- dataset$salary_in_usd

# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = .80, list = FALSE)[,1]

x_train <- x[ inTrain, ]
x_test  <- x[-inTrain, ]

y_train <- y[ inTrain]
y_test  <- y[-inTrain]

after splitting the data, now we can perform the selection using rfe

set.seed(1)
result_rfe1 <- rfe(x = x_train, 
                   y = y_train, 
                   sizes = c(1:9),
                   rfeControl = control)

result_rfe1

predictors(result_rfe1)

The results show that all the remaining attributes, except for “employment_type”, are selected. This is logical, as 98% of the rows have the value “FT”, as shown in the table below. Due to the low variance, we decided to remove this attribute.

table(dataset$employment_type)
dataset<-dataset[,-which(names(dataset)=="employment_type")]

phase 3

dataset2= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/preprocessedDataset.csv"), header=TRUE)


char_vars <- sapply(dataset2, is.character)
dataset2[char_vars] <- lapply(dataset2[char_vars], as.factor)

balancing data

To resolve the problem of class imbalance in the dataset, we will use SMOTE() method that oversample the minority class by creating synthetic samples using the existing minority class samples

data_balanced <- SMOTE(salary_in_usd ~ ., dataset2, perc.over = 300, perc.under=500, k = 10)

Classification

The goal of all preceding steps is to properly prepare the dataset for the classification phase, which constitutes one of our primary mining objectives. In this section, we will employ various attribute selection methods such as the Gini index, Gain ratio, and information gain to construct a decision tree model. We will thoroughly evaluate its performance, and if it proves effective, it can subsequently be utilized to classify new instances with unknown class labels.

since our dataset is small, we decided to use K-fold Cross-validation. for each attribute selection method we will try different K size (10,5, and 3)

the following function will be used to copute average sensitivity and Specificity



macro = function(matrix){
  
  sumSen=0
  
  for (i in 1:5) {
   sumSen = sumSen + matrix$byClass[i,1] 
  }
  
  
  avgSen = sumSen/5
  
  sumSpec=0
  
  for (i in 1:5) {
   sumSpec = sumSpec + matrix$byClass[i,2] 
  }
  avgSpec = sumSpec/5
  
  
  avgs = data.frame(Sensitivity=avgSen , Specificity=avgSpec, Accuracy= unname( matrix$overall[1]) )
  print(avgs)
  
  
}

Gini index

Gini index measures the impurity of the dataset. The partitioning that yields the most substantial reduction in impurity is selected as the splitting attribute. To apply the Gini index, we will employ the “rpart” method, which utilizes the Gini index as the criteria for splitting.

10 Folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")

giniIndex10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)

prp(giniIndex10$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)

NA
NA
NA
NA

caret::confusionMatrix(giniIndex10$pred$obs,giniIndex10$pred$pred)
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High         0  29    122        18        5
  Low          0 133     67         1       69
  Medium       0  71    134         9       23
  Very_High    0   5     98       146        3
  Very_Low     0  93     52         6      113

Overall Statistics
                                          
               Accuracy : 0.4394          
                 95% CI : (0.4111, 0.4681)
    No Information Rate : 0.3952          
    P-Value [Acc > NIR] : 0.001007        
                                          
                  Kappa : 0.2891          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity                   NA     0.4018        0.2833           0.8111
Specificity               0.8546     0.8418        0.8577           0.8958
Pos Pred Value                NA     0.4926        0.5654           0.5794
Neg Pred Value                NA     0.7864        0.6469           0.9640
Prevalence                0.0000     0.2765        0.3952           0.1504
Detection Rate            0.0000     0.1111        0.1119           0.1220
Detection Prevalence      0.1454     0.2256        0.1980           0.2105
Balanced Accuracy             NA     0.6218        0.5705           0.8534
                     Class: Very_Low
Sensitivity                   0.5305
Specificity                   0.8465
Pos Pred Value                0.4280
Neg Pred Value                0.8928
Prevalence                    0.1779
Detection Rate                0.0944
Detection Prevalence          0.2206
Balanced Accuracy             0.6885
5 Folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")

giniIndex5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)

prp(giniIndex5$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)

NA
NA

caret::confusionMatrix(giniIndex5$pred$obs,giniIndex5$pred$pred)
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High         0  23     99        41       11
  Low          0 118     51        17       84
  Medium       0  66    107        36       28
  Very_High    0   4     82       162        4
  Very_Low     0  85     39        19      121

Overall Statistics
                                         
               Accuracy : 0.4244         
                 95% CI : (0.3962, 0.453)
    No Information Rate : 0.3158         
    P-Value [Acc > NIR] : 1.978e-15      
                                         
                  Kappa : 0.2692         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity                   NA    0.39865       0.28307           0.5891
Specificity               0.8546    0.83130       0.84127           0.9024
Pos Pred Value                NA    0.43704       0.45148           0.6429
Neg Pred Value                NA    0.80798       0.71771           0.8804
Prevalence                0.0000    0.24728       0.31579           0.2297
Detection Rate            0.0000    0.09858       0.08939           0.1353
Detection Prevalence      0.1454    0.22556       0.19799           0.2105
Balanced Accuracy             NA    0.61497       0.56217           0.7457
                     Class: Very_Low
Sensitivity                   0.4879
Specificity                   0.8493
Pos Pred Value                0.4583
Neg Pred Value                0.8639
Prevalence                    0.2072
Detection Rate                0.1011
Detection Prevalence          0.2206
Balanced Accuracy             0.6686
3 Folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")

giniIndex3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)

prp(giniIndex3$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)

NA
NA

caret::confusionMatrix(giniIndex3$pred$obs,giniIndex3$pred$pred)
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High         0  30     76        62        6
  Low          0 161     33        28       48
  Medium       0  86     80        60       11
  Very_High    0   6     59       176       11
  Very_Low     0 103     20        27      114

Overall Statistics
                                          
               Accuracy : 0.4436          
                 95% CI : (0.4152, 0.4723)
    No Information Rate : 0.3225          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.292           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity                   NA     0.4171       0.29851           0.4986
Specificity               0.8546     0.8656       0.83100           0.9100
Pos Pred Value                NA     0.5963       0.33755           0.6984
Neg Pred Value                NA     0.7573       0.80417           0.8127
Prevalence                0.0000     0.3225       0.22389           0.2949
Detection Rate            0.0000     0.1345       0.06683           0.1470
Detection Prevalence      0.1454     0.2256       0.19799           0.2105
Balanced Accuracy             NA     0.6413       0.56475           0.7043
                     Class: Very_Low
Sensitivity                  0.60000
Specificity                  0.85104
Pos Pred Value               0.43182
Neg Pred Value               0.91854
Prevalence                   0.15873
Detection Rate               0.09524
Detection Prevalence         0.22055
Balanced Accuracy            0.72552

Gain ratio

The gain ratio, a normalized measure of information gain, is calculated by dividing information gain by the split information. The attribute that yields the highest gain ratio is chosen as the splitting attribute. The C4.5 algorithm employs the gain ratio.

The J48 is the Java-based open-source implementation of the C4.5 algorithm, and it is included in the Weka package. This implementation allows users to conveniently apply the C4.5 decision tree.

10 Folds

set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
gainRatio10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio10$finalModel)

gainRatio10cm = caret::confusionMatrix(gainRatio10$pred$obs, gainRatio10$pred$pred)

gainRatio10cm
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High       100  18     32        21        3
  Low         28 148     40         3       51
  Medium      53  49    115        14        6
  Very_High   28   6     18       194        6
  Very_Low     2  39     11         5      207

Overall Statistics
                                          
               Accuracy : 0.6383          
                 95% CI : (0.6103, 0.6655)
    No Information Rate : 0.2281          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.5465          
                                          
 Mcnemar's Test P-Value : 0.167           

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity              0.47393     0.5692       0.53241           0.8186
Specificity              0.92495     0.8698       0.87564           0.9396
Pos Pred Value           0.57471     0.5481       0.48523           0.7698
Neg Pred Value           0.89150     0.8792       0.89479           0.9545
Prevalence               0.17627     0.2172       0.18045           0.1980
Detection Rate           0.08354     0.1236       0.09607           0.1621
Detection Prevalence     0.14536     0.2256       0.19799           0.2105
Balanced Accuracy        0.69944     0.7195       0.70402           0.8791
                     Class: Very_Low
Sensitivity                   0.7582
Specificity                   0.9383
Pos Pred Value                0.7841
Neg Pred Value                0.9293
Prevalence                    0.2281
Detection Rate                0.1729
Detection Prevalence          0.2206
Balanced Accuracy             0.8483

5 Folds

set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
gainRatio5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio5$finalModel)


gainRatio5cm=caret::confusionMatrix(gainRatio5$pred$obs, gainRatio5$pred$pred)

gainRatio5cm
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High       102  21     34        16        1
  Low         31 148     38         1       52
  Medium      56  46    103        15       17
  Very_High   30   7     18       194        3
  Very_Low     3  42     10         9      200

Overall Statistics
                                          
               Accuracy : 0.6241          
                 95% CI : (0.5959, 0.6516)
    No Information Rate : 0.2281          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5289          
                                          
 Mcnemar's Test P-Value : 0.007667        

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity              0.45946     0.5606       0.50739           0.8255
Specificity              0.92615     0.8692       0.86519           0.9397
Pos Pred Value           0.58621     0.5481       0.43460           0.7698
Neg Pred Value           0.88270     0.8749       0.89583           0.9566
Prevalence               0.18546     0.2206       0.16959           0.1963
Detection Rate           0.08521     0.1236       0.08605           0.1621
Detection Prevalence     0.14536     0.2256       0.19799           0.2105
Balanced Accuracy        0.69281     0.7149       0.68629           0.8826
                     Class: Very_Low
Sensitivity                   0.7326
Specificity                   0.9307
Pos Pred Value                0.7576
Neg Pred Value                0.9218
Prevalence                    0.2281
Detection Rate                0.1671
Detection Prevalence          0.2206
Balanced Accuracy             0.8317

3 Folds

set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
gainRatio3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio3$finalModel)

gainRatio3cm=caret::confusionMatrix(gainRatio3$pred$obs, gainRatio3$pred$pred)

gainRatio3cm
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High        94  18     39        19        4
  Low         25 129     39         3       74
  Medium      47  47    110        19       14
  Very_High   14   4     27       200        7
  Very_Low     3  32      9        12      208

Overall Statistics
                                          
               Accuracy : 0.619           
                 95% CI : (0.5909, 0.6467)
    No Information Rate : 0.2565          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5216          
                                          
 Mcnemar's Test P-Value : 0.007322        

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity              0.51366     0.5609        0.4911           0.7905
Specificity              0.92110     0.8542        0.8695           0.9449
Pos Pred Value           0.54023     0.4778        0.4641           0.7937
Neg Pred Value           0.91300     0.8910        0.8812           0.9439
Prevalence               0.15288     0.1921        0.1871           0.2114
Detection Rate           0.07853     0.1078        0.0919           0.1671
Detection Prevalence     0.14536     0.2256        0.1980           0.2105
Balanced Accuracy        0.71738     0.7075        0.6803           0.8677
                     Class: Very_Low
Sensitivity                   0.6775
Specificity                   0.9371
Pos Pred Value                0.7879
Neg Pred Value                0.8939
Prevalence                    0.2565
Detection Rate                0.1738
Detection Prevalence          0.2206
Balanced Accuracy             0.8073

analasys of the gain ratio classification

all 3 trees seem to have the same structure that is

the attribute that was first selected at the node is the experience level, it has divided the tree into : right subtree : SE(Senior level) EX(Executive level) left subtree : EN(Entry-level) MI(Mid level)

Each of these subtrees further refines the classification based on the attribute “employee residence.” However, there are different criteria for splitting in the right and left subtrees:

In the Right Subtree:

The split is based on whether the tuple has the value “Latin America & Caribbean.” In the Left Subtree:

If the experience level is 1, the tree further partitions based on whether the tuple has the value “North America.” If the experience level is 2, the split is based on “employee residence” being “Latin America & Caribbean.”

rbind("10 Folds"=macro(gainRatio10cm), "5 Folds"=macro(gainRatio5cm), "3 Folds"=macro(gainRatio3cm)  )

Based on the evaluation metrics of Sensitivity, Specificity, and Accuracy, it is evident that the gain ratio model, built using a 10-fold cross-validation approach, exhibits superior performance compared to the other two models. However, it’s worth noting that the difference in performance between the models is relatively small. Notably, as the number of folds decreases, a corresponding decline in the model’s performance becomes apparent.

Information gain

10 Folds

set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")


infoGain10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 3 for this object. Predictions generated using 3 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trials
c5model <- C5.0(salary_in_usd ~ .,
                       data = data_balanced,
                       trials = infoGain10$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain10$bestTune$winnow))
plot(c5model)

caret::confusionMatrix(infoGain10$pred$obs, infoGain10$pred$pred)
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High        90  19     40        20        5
  Low         24 127     42        10       67
  Medium      51  58     96        19       13
  Very_High   17   3     15       208        9
  Very_Low     3  43      4        10      204

Overall Statistics
                                          
               Accuracy : 0.6057          
                 95% CI : (0.5773, 0.6335)
    No Information Rate : 0.249           
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.5046          
                                          
 Mcnemar's Test P-Value : 0.03427         

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity              0.48649     0.5080        0.4873           0.7790
Specificity              0.91700     0.8490        0.8590           0.9527
Pos Pred Value           0.51724     0.4704        0.4051           0.8254
Neg Pred Value           0.90714     0.8673        0.8948           0.9376
Prevalence               0.15455     0.2089        0.1646           0.2231
Detection Rate           0.07519     0.1061        0.0802           0.1738
Detection Prevalence     0.14536     0.2256        0.1980           0.2105
Balanced Accuracy        0.70174     0.6785        0.6732           0.8659
                     Class: Very_Low
Sensitivity                   0.6846
Specificity                   0.9333
Pos Pred Value                0.7727
Neg Pred Value                0.8992
Prevalence                    0.2490
Detection Rate                0.1704
Detection Prevalence          0.2206
Balanced Accuracy             0.8089

5 Folds

set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")


infoGain5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trials
c5model <- C5.0(salary_in_usd ~ .,
                       data = data_balanced,
                       trials = infoGain5$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain5$bestTune$winnow))
plot(c5model)

caret::confusionMatrix(infoGain5$pred$obs, infoGain5$pred$pred)
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High        85  20     47        16        6
  Low         25 129     38        10       68
  Medium      37  70     99        14       17
  Very_High   16   4     13       210        9
  Very_Low     2  37      9        13      203

Overall Statistics
                                          
               Accuracy : 0.6065          
                 95% CI : (0.5782, 0.6343)
    No Information Rate : 0.2531          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5049          
                                          
 Mcnemar's Test P-Value : 0.001691        

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity              0.51515     0.4962       0.48058           0.7985
Specificity              0.91376     0.8495       0.86075           0.9550
Pos Pred Value           0.48851     0.4778       0.41772           0.8333
Neg Pred Value           0.92180     0.8587       0.88854           0.9439
Prevalence               0.13784     0.2172       0.17210           0.2197
Detection Rate           0.07101     0.1078       0.08271           0.1754
Detection Prevalence     0.14536     0.2256       0.19799           0.2105
Balanced Accuracy        0.71446     0.6728       0.67066           0.8768
                     Class: Very_Low
Sensitivity                   0.6700
Specificity                   0.9318
Pos Pred Value                0.7689
Neg Pred Value                0.8928
Prevalence                    0.2531
Detection Rate                0.1696
Detection Prevalence          0.2206
Balanced Accuracy             0.8009

3 Folds

set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")


infoGain3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trials
c5model <- C5.0(salary_in_usd ~ .,
                       data = data_balanced,
                       trials = infoGain3$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain3$bestTune$winnow))
plot(c5model)

caret::confusionMatrix(infoGain3$pred$obs, infoGain3$pred$pred)
Confusion Matrix and Statistics

           Reference
Prediction  High Low Medium Very_High Very_Low
  High        85  25     40        19        5
  Low         19 137     37         5       72
  Medium      55  64     93        11       14
  Very_High   16   8     22       197        9
  Very_Low     4  43     11         9      197

Overall Statistics
                                          
               Accuracy : 0.5923          
                 95% CI : (0.5639, 0.6203)
    No Information Rate : 0.2481          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.4874          
                                          
 Mcnemar's Test P-Value : 0.01149         

Statistics by Class:

                     Class: High Class: Low Class: Medium Class: Very_High
Sensitivity              0.47486     0.4946       0.45813           0.8174
Specificity              0.91257     0.8554       0.85513           0.9425
Pos Pred Value           0.48851     0.5074       0.39241           0.7817
Neg Pred Value           0.90811     0.8490       0.88542           0.9534
Prevalence               0.14954     0.2314       0.16959           0.2013
Detection Rate           0.07101     0.1145       0.07769           0.1646
Detection Prevalence     0.14536     0.2256       0.19799           0.2105
Balanced Accuracy        0.69372     0.6750       0.65663           0.8799
                     Class: Very_Low
Sensitivity                   0.6633
Specificity                   0.9256
Pos Pred Value                0.7462
Neg Pred Value                0.8928
Prevalence                    0.2481
Detection Rate                0.1646
Detection Prevalence          0.2206
Balanced Accuracy             0.7944

Silhouette method

#a)fviz_nbclust() with silhouette method using library(factoextra) 
fviz_nbclust(dataset, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")
Error in fviz_nbclust(dataset, kmeans, method = "silhouette") : 
  could not find function "fviz_nbclust"
---
title: "Cybersecurity salaries"
output: html_notebook
---



```{r}

knitr::opts_chunk$set(warning=FALSE)

```



### Needed libraries

```{r}
library(dplyr)
library(countrycode)
library(outliers)
library(caret)
library(cluster)
library(factoextra)
library(NbClust)
```

# phase 1

### Problem statement

Prediction of cyber security employees' salaries based on 11 attributes

1.work_year

2.experience_level

3.employment_type

4.job_title

5.salary

6.salary_currency

7.salary_in_usd

8.employee_residence

9.remote_ratio

10.company_location

11.company_size

### Problem description

We are living in the "information age" or rather the "data age", meaning that everything around us revolves around data. The data has become one of the most valuable assets that a person or an organisation can have, since it has a significant value, losing it will lead to significant damages. Thus, most of the attacks nowadays are directed toward the data. To guard against such damages, organisations have realised the importance of protecting their digital assets, leading them to hire cybersecurity specialists. This made cybersecurity gain popularity among people so there's a growing tendency to study cybersecurity. Consequently this resulted in the emergence of plentiful professionals with various experience levels and skills in this field. As a result, organisations may find it difficult to decide a salary for job candidates solely based on the CV. also, since the attacks improve rapidly, organisations need to hire more employees in the far future to defend against such attacks but it's not an easy matter to predict the future payroll which may hinders some of the organisation's plans. Another issue arises when the decision makers in the organisation aren't fully aware of the trends on salary. Their lack of awareness gives a chance for the competitor organisations to attract their employees to them by offering a better salary that match current trends

### Data mining task

Prediction of the cyber security employees' salary categories (Very Low, Low, , High, Very High) using classification methods.

### Goal

Given the problems we discussed and In order to better understand this field, we decided to analyse a dataset of 1247 cybersecurity employees, containing information such as salary, job title, and experience level. Analysing this dataset can provide insightful predictions regarding the salary range of a cybersecurity employee, which can help in

-   Making better decisions
-   Making recruitment and hiring process easier and more efficient
-   Predicting the future payroll
-   Increasing loyalty
-   Increasing the satisfaction rate
-   Achieving fairness

## Source of data:

<https://www.kaggle.com/datasets/deepcontractor/cyber-security-salaries>

### Reading and viewing dataset

```{r}
dataset= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/salaries_cyber.csv"), header=TRUE)
View(dataset)

```

### Original dataset

we will keep a copy of the original dataset before data preprocessing to use if needed at any time

```{r}
originalDataset= dataset
```

## General information about the dataset:

No. of attributes: 11\
Type of attributes: Ordinal , Nominal, and Numeric\
No. of objects: 1247\
Class label: salary_in_usd

```{r}
ncol(dataset)
nrow(dataset)
names(dataset)
str(dataset)
```

### Attributes' description table

+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| **Attribute Name** | **Description**                                             | **Data Type** | **Possible values**                                       |
+====================+=============================================================+===============+===========================================================+
| work_year          | The year in which salary was recorded                       | Numerical     | 2020 to 2022                                              |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| experience_level   | Expertise level of the employee                             | Ordinal       | En "Entry level"\                                         |
|                    |                                                             |               | MI "Mid level"\                                           |
|                    |                                                             |               | SE "Senior level"\                                        |
|                    |                                                             |               | EX "Executive level"                                      |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| employment_type    | The nature or category of employee's engagement in the job  | Nominal       | PT "Part time"\                                           |
|                    |                                                             |               | FT "Full time"\                                           |
|                    |                                                             |               | CT "Contract\                                             |
|                    |                                                             |               | FL"Freelancer"                                            |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| job_title          | The role worked in during the year                          | Nominal       | Different titles.                                         |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like Security Analyst, security researcher                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary             | The total gross salary amount paid                          | Numerical     | 1740-50001566                                             |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary_currency    | The currency of the salary paid to the employee             | Nominal       | Different currencies according to ISO 4217 currency code. |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like DE,CA                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary_in_usd      | The salary paid in United states dollar                     | Numerical     | 2000 to 365596.40                                         |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| employee_residence | Employee's primary country of residence                     | Nominal       | Different countries.                                      |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like US,AE                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| remote_ratio       | Percentage of online work by employee in the specified year | Numerical     | 0 "No remote work"\                                       |
|                    |                                                             |               | 50 "Partially remote"\                                    |
|                    |                                                             |               | 100 "Fully remote"                                        |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| company_location   | The country of the employer's main office                   | Nominal       | Different countries.                                      |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like BR,BW                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| company_size       | How big/small is the company                                | Ordinal       | S , M or L                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+

# phase 2

### sample of 20 employees from the dataset:

using sample_n(table,size) function and using (set_seed())

```{r}
set.seed(30)
sample=sample_n(dataset,20)
print(sample)
```

### Show the missing value:

if it is FALSE it means no null value,if it is TRUE there is null value. In our dataset there is no null values.

```{r}
is.na(dataset)
sum(is.na(dataset))
```

### Show the Min.,1st Qu.,Median,Mean ,3rd Qu.,Max. for each numeric column

```{r}
summary(dataset$work_year)
summary(dataset$salary)
summary(dataset$salary_in_usd)
summary(dataset$remote_ratio)
```

### Show the variane of each numeric column

```{r}
var(dataset$work_year)
var(dataset$salary)
var(dataset$salary_in_usd)
var(dataset$remote_ratio)
```

### Visualization of relationship between some pairs of attributes:

Here we used boxplot to see the distribution between salary_in_usd and experience_level We observed that salaries vary depending on the level of experience,they are positively correlated.

```{r}
boxplot(salary_in_usd ~ experience_level, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
```

Here we used boxplot to see the distribution between salary_in_usd and work_year We observed that 2021 salaries were close to each other but in 2022 the gap between them getting bigger.

```{r}
boxplot(salary_in_usd ~ work_year, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
```

Here we used boxplot to see the distribution between salary_in_usd and employment_type We observed that Full Time (FT) offers more salary than the other categories.

```{r}
boxplot(salary_in_usd ~ employment_type, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
```

Here we used boxplot to see the distribution between salary_in_usd and company_size We observed that the larger the company is the higher the salary was.

```{r}
boxplot(salary_in_usd ~ company_size, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999) 
```

## Data Reduction

### Dimensionality Reduction

The "salary" column gives the same information as "salary_in_usd" it's just a matter of currency exchange, and we will eventually transform all the values in "salary" column to one common currency so we can properly deal with them. To further confirm that the two column are redundant, we will use the latest exchange rate for USD to the desired currency.

we will start by creating a temporary column named "converted_salary" to save the salary that we will get by using the exchange rate to convert the salary_in_usd to the salary with different currencies to compare with "salary" column

```{r}
convertedDataset=dataset


convertedDataset$exchange_rate = factor(convertedDataset$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1/1,1/0.20,1/1.22,1/1.06,1/0.012,1/0.74,1/1.10,1/0.14,1/0.73,1/0.64,1/0.090,1/0.057,1/0.26,1/0.23,1/0.093,1/0.000065,1/0.60,1/0.0027,1/0.053,1/0.031,1/0.010))
convertedDataset$exchange_rate = as.numeric(as.character(convertedDataset$exchange_rate))
convertedDataset$converted_salary = convertedDataset$salary_in_usd*convertedDataset$exchange_rate



set.seed(1)
salary_sample <- sample_n(convertedDataset[,c("salary","converted_salary")],10)

print(salary_sample)
```

as shown in the sample, the two columns are almost identical. This can be proved by correlation coefficient as well.

```{r}
correlation <- cor(convertedDataset$salary , convertedDataset$converted_salary)
print(correlation)
```

The correlation is so high but it hasn't reached 100% possibly due to rounding in the calculations and slight differences in the exchange rate over time.

To make the mining process more effiecent and has an improved quality, we decided to remove the "salary" column.

```{r}
dataset<-dataset[,-c(5)]
```

### Find the outliers and remove them:

We will show outliers with boxPlots and then remove them, to minimize noise and to get better analytical results when applying data mining techniques.

now we show (salary_in_usd) attributes' outliers. we can see that there are many outliers with exceptionally high values, thus we will remove them.

```{r}
boxplot(dataset$salary_in_usd)



OutSalary = outlier(dataset$salary_in_usd, logical =TRUE)
Find_outlier = which(OutSalary ==TRUE, arr.ind = TRUE)
dataset= dataset[-Find_outlier,]

```

now we show (remote_ratio) attributes' outliers. we can see there aren't outliers in remote_ratio, thus we don't need the last step i.e: removing outliers' rows.

```{r}
boxplot(dataset$remote_ratio)

```

now we show (work_year) attributes' outliers. we can see there aren't outliers in work_year, thus we don't need the last step i.e: removing outliers' rows.

```{r}
boxplot(dataset$work_year)

```

### Concept hierarchy generation for nominal data

the columns "company_location" and "employee_residence" have the name of countries for the company and employee respectively. And these attributes can be generalized to higher-level concept that is region to help understand and analyze the dataset better and improve algorithm performance.

We will use the 7 regions as defined in the World Bank Development Indicators. These regions are:

1.  East Asia and Pacific: This region includes countries like China, Australia, Indonesia, Thailand, etc.

2.  Europe and Central Asia: This region includes countries like Germany, UK, Russia, Turkey, etc.

3.  Latin America & Caribbean: This region includes countries like Brazil, Mexico, Argentina, Cuba, etc.

4.  Middle East and North Africa: This region includes countries like Saudi Arabia, Egypt, Iran, Iraq, etc.

5.  North America: This is predominantly United States and Canada.

6.  South Asia: This region includes countries like India, Pakistan, Bangladesh, Sri Lanka, etc.

7.  Sub-Saharan Africa: This region includes countries like Nigeria, South Africa, Ethiopia, Kenya, etc.

Note: UM(The United States Minor Outlying Islands) and AQ(Antarctica) don't belong to any of these regions, thus, they will be used as they are.

```{r}


um=which(dataset$company_location=="UM")
aq=which(dataset$company_location=="AQ")


dataset$company_location <- countrycode(dataset$company_location, "iso2c", "region")
dataset$employee_residence <- countrycode(dataset$employee_residence, "iso2c", "region")

dataset[um,"company_location"]="UM"
dataset[aq,"company_location"]="AQ"

```

Concept hierarchy generation can be done for "job_title" as well to improve interpretation and scalability. Also, most job titles are essentially the same job but with different names, so we can combine them into a higher-level jobs titles such as Architect, Analyst and Engineer etc.

```{r}
## Create the categories based on job rank 
dataset$job_title <- ifelse(grepl("Analyst", dataset$job_title), "Analyst",
                                ifelse(grepl("Architect", dataset$job_title), "Architect",
                                       ifelse(grepl("Engineer", dataset$job_title), "Engineer",
                                              ifelse(grepl("Manager|Officer|Director|Leader", dataset$job_title), "Leadership",
                                                     ifelse(grepl("Consultant|Specialist", dataset$job_title), "Consultant/Specialist",
                                                            ifelse(grepl("Cyber", dataset$job_title), "Cyber Security",
                                                                   "Others"))))))

```

## Encoding categorical data

To deal with columns with character type we are going to encode them, because most machine learning algorithms are designed to work with factors data rather than character data and it improves performance and Interpretability of data as well.

```{r}
dataset$job_title  <- factor(dataset$job_title)

dataset$experience_level = factor(dataset$experience_level, levels=c("EN", "MI", "SE", "EX"), labels=c(1,2,3,4))

dataset$employment_type  <- factor(dataset$employment_type)

dataset$employee_residence  <- factor(dataset$employee_residence)

dataset$company_location  <- factor(dataset$company_location)

dataset$salary_currency  <- factor(dataset$salary_currency)

dataset$job_title  <- factor(dataset$job_title)


dataset$company_size = factor(dataset$company_size, levels=c("S","M","L"), labels=c(1,2,3))


dataset$job_title  <- factor(dataset$job_title)

```

### Discretization of salaray_in_usd attribute

by calculating breaks based on quartiles

```{r}
breaks <- quantile(dataset$salary_in_usd, 
                   probs = c(0, .25, .5, .75, .95, 1), 
                   na.rm = TRUE)


dataset$salary_in_usd <- cut(dataset$salary_in_usd, 
                                       breaks = breaks, 
                                       include.lowest = TRUE, 
                                       labels=c("Very Low", "Low", "Medium", "High", "Very High"))


```

### Normalization:

to change the scale of numeric attributes (remote_ratio and work_year) to a scale of [-1,1] to give them equal weight

```{r}
dataset [, c("work_year" , "remote_ratio")] = scale(dataset [, c("work_year" , "remote_ratio")])
```

## Feature Selection

we will implement feature selection to remove redundant or irrelevant attributes from the data set to get the smallest subset that can help us get the most accurate predictions for our target class(salary_in_usd) and decrease the time that it takes the classifier to process the data.

we will use RFE(Recursive feature elimination) which is a wrapper method for the feature selection. Since the RFE function have multiple control options we need to specify the options that we want. We will choose "Random Forest" because it has high accuracy, can handle categorical data.

```{r}
control <- rfeControl(functions = rfFuncs, 
                      method = "repeatedcv",
                      repeats = 5, 
                      number = 10)
```

First we save the features to be used in the feature selection(every attributes except the class label "salary_in_usd") in variable x, and the class label in variable y. Then split the data to 80% training and 20% test.

```{r}
x <- dataset %>%
  select(-salary_in_usd) %>%
  as.data.frame()

# Target variable
y <- dataset$salary_in_usd

# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = .80, list = FALSE)[,1]

x_train <- x[ inTrain, ]
x_test  <- x[-inTrain, ]

y_train <- y[ inTrain]
y_test  <- y[-inTrain]

```

after splitting the data, now we can perform the selection using rfe

```{r}
set.seed(1)
result_rfe1 <- rfe(x = x_train, 
                   y = y_train, 
                   sizes = c(1:9),
                   rfeControl = control)

result_rfe1

predictors(result_rfe1)

```

The results show that all the remaining attributes, except for "employment_type", are selected. This is logical, as 98% of the rows have the value "FT", as shown in the table below. Due to the low variance, we decided to remove this attribute.

```{r}
table(dataset$employment_type)
```

```{r}
dataset<-dataset[,-which(names(dataset)=="employment_type")]
```


# phase 3










```{r}
dataset2= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/preprocessedDataset.csv"), header=TRUE)


char_vars <- sapply(dataset2, is.character)
dataset2[char_vars] <- lapply(dataset2[char_vars], as.factor)

```






## balancing data

To resolve the problem of class imbalance in the dataset, we will use SMOTE() method that oversample the minority class by creating synthetic samples using the existing minority class samples

```{r}
data_balanced <- SMOTE(salary_in_usd ~ ., dataset2, perc.over = 300, perc.under=500, k = 10)
```

## Classification

The goal of all preceding steps is to properly prepare the dataset for the classification phase, which constitutes one of our primary mining objectives. In this section, we will employ various attribute selection methods such as the Gini index, Gain ratio, and information gain to construct a decision tree model. We will thoroughly evaluate its performance, and if it proves effective, it can subsequently be utilized to classify new instances with unknown class labels.

since our dataset is small, we decided to use K-fold Cross-validation. for each attribute selection method we will try different K size (10,5, and 3)



the following function will be used to copute average sensitivity and Specificity
```{r}


macro = function(matrix){
  
  sumSen=0
  
  for (i in 1:5) {
   sumSen = sumSen + matrix$byClass[i,1] 
  }
  
  
  avgSen = sumSen/5
  
  sumSpec=0
  
  for (i in 1:5) {
   sumSpec = sumSpec + matrix$byClass[i,2] 
  }
  avgSpec = sumSpec/5
  
  
  avgs = data.frame(Sensitivity=avgSen , Specificity=avgSpec, Accuracy= unname( matrix$overall[1]) )
  print(avgs)
  
  
}

```


### Gini index

Gini index measures the impurity of the dataset. The partitioning that yields the most substantial reduction in impurity is selected as the splitting attribute. To apply the Gini index, we will employ the "rpart" method, which utilizes the Gini index as the criteria for splitting.

##### 10 Folds

```{r}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")

giniIndex10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)

prp(giniIndex10$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)




```

```{r}

caret::confusionMatrix(giniIndex10$pred$obs,giniIndex10$pred$pred)

```

##### 5 Folds

```{r}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")

giniIndex5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)

prp(giniIndex5$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)


```

```{r}

caret::confusionMatrix(giniIndex5$pred$obs,giniIndex5$pred$pred)

```

##### 3 Folds

```{r}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")

giniIndex3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)

prp(giniIndex3$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)


```

```{r}

caret::confusionMatrix(giniIndex3$pred$obs,giniIndex3$pred$pred)

```




### Gain ratio


The gain ratio, a normalized measure of information gain, is calculated by dividing information gain by the split information. The attribute that yields the highest gain ratio is chosen as the splitting attribute. The C4.5 algorithm employs the gain ratio.

The J48 is the Java-based open-source implementation of the C4.5 algorithm, and it is included in the Weka package. This implementation allows users to conveniently apply the C4.5 decision tree.



#### 10 Folds

```{r , fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
gainRatio10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio10$finalModel)
```





```{r}
gainRatio10cm = caret::confusionMatrix(gainRatio10$pred$obs, gainRatio10$pred$pred)

gainRatio10cm


```








#### 5 Folds

```{r , fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
gainRatio5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio5$finalModel)
```





```{r}

gainRatio5cm=caret::confusionMatrix(gainRatio5$pred$obs, gainRatio5$pred$pred)

gainRatio5cm

```



#### 3 Folds

```{r, fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
gainRatio3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio3$finalModel)
```


```{r}
gainRatio3cm=caret::confusionMatrix(gainRatio3$pred$obs, gainRatio3$pred$pred)

gainRatio3cm

```





### analasys of the gain ratio classification

all 3 trees seem to have the same structure that is

the attribute that was first selected at the node is the experience level, it has divided the tree into :
right subtree : SE(Senior level) EX(Executive level)
left subtree : EN(Entry-level)   MI(Mid level)

Each of these subtrees further refines the classification based on the attribute "employee residence." However, there are different criteria for splitting in the right and left subtrees:

In the Right Subtree:

The split is based on whether the tuple has the value "Latin America & Caribbean."
In the Left Subtree:

If the experience level is 1, the tree further partitions based on whether the tuple has the value "North America."
If the experience level is 2, the split is based on "employee residence" being "Latin America & Caribbean."



```{r}
rbind("10 Folds"=macro(gainRatio10cm), "5 Folds"=macro(gainRatio5cm), "3 Folds"=macro(gainRatio3cm)  )
```



Based on the evaluation metrics of Sensitivity, Specificity, and Accuracy, it is evident that the gain ratio model, built using a 10-fold cross-validation approach, exhibits superior performance compared to the other two models. However, it's worth noting that the difference in performance between the models is relatively small. Notably, as the number of folds decreases, a corresponding decline in the model's performance becomes apparent.









### Information gain

#### 10 Folds

```{r}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")


infoGain10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = data_balanced,
                       trials = infoGain10$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain10$bestTune$winnow))
plot(c5model)
```

```{r}
caret::confusionMatrix(infoGain10$pred$obs, infoGain10$pred$pred)

```

#### 5 Folds

```{r}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")


infoGain5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = data_balanced,
                       trials = infoGain5$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain5$bestTune$winnow))
plot(c5model)
```

```{r}
caret::confusionMatrix(infoGain5$pred$obs, infoGain5$pred$pred)

```

#### 3 Folds

```{r}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")


infoGain3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = data_balanced,
                       trials = infoGain3$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain3$bestTune$winnow))
plot(c5model)
```

```{r}
caret::confusionMatrix(infoGain3$pred$obs, infoGain3$pred$pred)

```




### Silhouette method

```{r}
#a)fviz_nbclust() with silhouette method using library(factoextra) 
fviz_nbclust(dataset, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")
#b) NbClust validation
fres.nbclust <- NbClust(dataset, distance="euclidean", min.nc = 2, max.nc = 10, method="kmeans", index="all")

```

